Chapter 5 Tokenization
Tokenization refers to the process of segmenting a long piece of discourse into smaller linguistic units. These linguistic units, depending on your purposes, may vary in many different ways:
- paragraphs
- sentences
- words
- syllables/characters
- letters
- phonemes
In this chapter, we are going to look at this issue in more detail. Specifically, we will discuss the idea of word co-occurrence, which is one of the most fundamental method in corpus linguistics, and relate it to the issue of tokenization.
5.1 English Tokenization
To get a clearer idea how tokenization works in unnest_tokens, we first create a simple corpus x in a tidy structure, i.e., a tibble, with one text only.
x <- tibble(id = 1, text = "'When I'M a Duchess,' she said to herself, (not in a very hopeful tone\nthough), 'I won't have any pepper in my kitchen AT ALL. Soup does very\n well without--Maybe it's always pepper that makes people hot-tempered,'...")
x## 'When I'M a Duchess,' she said to herself, (not in a very hopeful tone
## though), 'I won't have any pepper in my kitchen AT ALL. Soup does very
## well without--Maybe it's always pepper that makes people hot-tempered,'...
(Please note that there are two line breaks in the text.)
If we use the default setting token = "words" in unnest_tokens, we will get:
5.2 Text Analytics Pipeline
5.3 Proper Units for Analysis
5.3.1 Sentence Tokenization
In text analysitcs, what we often do is the sentence tokenization
## [1] "tbl_df" "tbl" "data.frame"
In unnest_token of the tidytext library, you can specify the parameter token to customize tokenzing function. As this library is designed to deal with English texts, there are several built-in options for English text tokenizatios, including words(default), characters, character_shingles, ngrams, skip_ngrams, sentences, lines, paragraphs, regex and ptb (Penn Treebank).
Sometimes it is good to give each sentence of the document an index, e.g., ID, which can help us easily keep track of the relative position of the sentence in the original document.
5.3.2 Words Tokenization
Corpus linguistics deal with words all the time. Word tokenization therefore is the most often used method to segment texts. This is not a big concern for languages like English, which usually puts a whitespace between words.
Please note that by default, token = “words” would normalize the texts to lower-casing letters. Also, all the non-word tokens are automatically removed. If you would like to preserve the casing differences and the punctuations, you can include the following arguments in unnest_tokens(…, token = “words”,strip_punct = F, strip_numeric = F).
5.4 Lexical Bundles (n-grams)
Sometimes it is helpful to identify frequently occurring n-grams, i.e., recurrent multiple word sequences. You can easily create an n-gram frequency list using unnest_tokens():
corp_us_trigram <- corp_us_df %>%
unnest_tokens(trigrams, text, token = "ngrams", n = 3)
corp_us_trigramWe then can examine which n-grams were most often used by each President:
Exercise 5.3 Please subset the top 3 trigrams of President Don. Trump, Bill Clinton, John Adams, from corp_us_trigram.
When looking at frequency lists, there is another distributional metric we need to consider: dispersion. An n-gram would be meaningful if its frequency is high. However, this high frequency may come in different meanings. What if the n-gram only occurs in ONE particular document, i.e., used only by a particular President? Or alternatively, what if the n-gram appears in many different documents, i.e., used by different Presidents?
So now let’s compute the dispersion of the n-grams in our corp_us_df. Here we define the dispersion of an n-gram as the number of documents where it occurs.
corp_us_trigram %>%
count(trigrams, President) %>%
group_by(trigrams) %>%
summarize(freq = sum(n), dispersion = n()) %>%
arrange(desc(dispersion))#
# corp_us_trigram %>%
# count(trigrams, President) %>%
# group_by(trigrams) %>%
# summarize(freq = sum(n), dispersion = n()) %>%
# arrange(desc(freq))Therefore, usually lexical bundles or n-grams are defined based on distrubtional patterns of these multiword units. In particular, cut-off values are often determined to select a list of meaningful lexical bundles. These cut-off values include: the frequency of the n-grams, as well as the dispersion of the n-grams.